® 



J 



Europalsches Patentamt 
European Patent Office 
Office europeen des brevets 





(Tj) Publication number: 0 646 858 A1 



EUROPEAN PATENT APPLICATION 



(21) Application number : 94306322.2 
(g) Date of filing : 26.08.94 



© int. ci. 6 : G06F 3/06, G06F 13/40, 
G06F 11/20 



(30) Priority : 07.09.93 US 124653 

@ Date of publication of application : 
05.04.95 Bulletin 95/14 

@ Designated Contracting States : 
DE FR GB 



© Applicant : AT & T GLOBAL INFORMATION 
SOLUTIONS INTERNATIONAL INC. 
1700 South Patterson Boulevard 
Dayton, Ohio 45479 (US) 



@ Inventor : DuLac, Keith Bernard 
8652 Hila 

Derby, KS 67037 (US) 

(74) Representative : Robinson, Robert George 
International Patent Department, 
AT&T GIS Limited, 
915 High Road, 
North Finchley 
London N12 8QJ (GB) 



oo 

00 
<0 



(54) Data storage system architecture. 

(57) A data storage system comprises a matrix of 
intelligent storage nodes interconnected to 
communicate with each other via a network of 
busses (Ro-RfmCo-CfJ. The network of busses 
includes a plurality of first busses (Ro-RJ for 
conducting data from and to a corresponding 
plurality of host system processors (Hq-HJ and 
a plurality of second busses (C 0 -Cn), each one of 
the second busses intersecting with each one of 
the first busses. The nodes are located at each 
intersection. The storage nodes each include a 
data storage device '(D), such as a magnetic disk 
drive unit, a processor (P) and buffer memory 
(B1-B3), whereby the node processor controls 
the storage and retrieval of data at the node as 
well as being capable of co-ordinating the stor- 
age and retrieval of data at other nodes within 
the network. 
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This invention relates to data storage systems. 

Disk array storage systems are known, which include a plurality of disk drives which operate in parallel 
and appear to the host system as a single large disk drive. Numerous disk array design alternatives are pos- 
sible, incorporating a few to many disk drives. Several array alternatives, each possessing different attributes, 
5 benefits and shortcomings, are presented in an article titled "A Case for Redundant Arrays of Inexpensive Disks 
(RAID)" by David A Patterson, Garth Gibson and Randy H. Katz; University of California Report No. UCB/CSD 
87/391, December 1987. This article discusses disk arrays and the improvements in performance, reliability, 
power consumption and scalability that disk arrays provide in comparison to single large magnetic disks. Five 
different storage configurations are discussed, referred to as RAID levels 1 to 5, respectively. 
10 Complex storage management techniques are required in order to coordinate the operation of the multitude 
of data storage devices within an array to perform read and write functions, parity generation and checking, 
data restoration and reconstruction, and other necessary or optional operations. Array operation can be man- 
aged by a dedicated hardware controller constructed to control array operations. However, the data storage 
performance achievable using this technique is limited. 
15 It is an object of the present invention to provide a data storage system which enables a high data storage 
performance to be achieved. 

Therefore, according to one aspect of the present invention, there is provided a data storage system in- 
cluding a network of nodes, interconnected to communicate with each other via a plurality of busses, charac- 
terized in that each one of said nodes includes: a data storage device connected to receive data from and pro- 
20 vide data to at least one of said plurality of busses; and a node processor connected to said at least one of 
said busses for controlling the storage and retrieval of data at said data storage device, said node processor 
being capable of controlling the storage and retrieval of data at data storage devices associated with additional 
nodes of said plurality of nodes through communications between said node processor and said additional 
nodes via said plurality of busse. 
25 According to another aspect of the present invention, there is provided a method for transferring data be- 
tween a host processor and a matrix of data storage nodes, each node including a data storage device and 
control logic for coordinating data storage operations for a plurality of data storage nodes, characterized by 
the step of selecting any one of said data storage nodes to control the transfer of data between said host proc- 
essor and a first subset of said matrix of data storage nodes. 
30 One embodiment of the present invention will now be described by way of example, with reference to the 
accompanying drawings in which :- 

Figure 1 is a diagrammatic illustration of a data storage system including a plurality of disk drives and in- 
expensive processors located within a matrix network, constructed in accordance with the present inven- 
tion; and 

35 Figure 2 is a block diagram showing the processor, disk drive, and associated elements located within each 
node of the matrix network illustrated in Figure 1 . 

Referring now to Figures 1 and 2, there is seen a data storage system in accordance with the present in- 
vention. The architecture shown in Figure 1 includes a host processor connection block 12 providing connection 
to one or more host system processors, not shown. The host processors are identified by reference numerals 

40 H 0 , H 1t H 2 , • H m . Connection block 12 couples host processors H 0 , H 1( H 2 ,—H m to a network 14 of data storage 
nodes. Network 14 includes several busses, Ro through R^, arranged in rows, each bus connecting one of host 
processors H 0 through H m with a group of storage nodes. Network 14 further includes several busses, Co 
through C n , arranged in columns. A node is formed at every intersection between a row and column bus. The 
nodes are identified by pairs of coordinates, the first coordinate referring to the number of the row bus to which 

45 it connects, and the second coordinate identifying the column bus to which the node connects. The network 
includes nodes from (0, 0), at the intersection of busses Ro and C 0 , through (m f n), at the intersection of busses 
Rm and C n . 

H 0 in the configuration shown is connected directly to storage nodes (0, 0) through (0 f n) through bus R0. 
In addition, Ho is provided access to all the storage nodes on bus C 0 , i.e., nodes (1 , 0) through (m, 0) by passing 
50 through node (0, 0). Nodes (0, 1 ) through (0, n) similarly provide access for processor H 0 to the nodes on busses 
Ci through C n , respectfully. Each one of host processors Hi through H m has direct access to all the storage 
nodes on busses R 5 through R mi respectively, and access through interconnecting nodes to all the storage 
nodes on network 14. 

Host processor connection block 12 may include logic for executing group array algorithms, such as the 
55 RAID algorithms that are necessary for issuing I/O operations, handling error exception conditions, and per- 
forming data reconstruction, when a storage device in network 14 fails. Other functions of the logic included 
in connection block 12 may include diagnostic and group algorithm initialization executed in response to input 
provided by a system administration. In a high performance configuration, a host processor connection block 
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will exist for every row bus (Ro through RJ that is shown in node network 14. The high performance config- 
uration allows multiple I/O commands and data to flow over the attached row busses simultaneously. In a lower 
performance, lower cost configuration command and data flow over one row bus. 

Each of storage nodes (0, 0) through (m, n) includes a storage device, node processor, buffers and inter- 
5 face logic as shown in Figure 2. A block diagram showing the processor, disk drive, and associated elements 
located within node (m, n) of network 14 is shown. 

Node (m, n) is seen to include an interface l/F 1 to column bus C n , a second interface l/F 2 to row bus R™ 
an inexpensive processor P, data buffers B1, B2, 1 and B3, and a storage element D, such as a Head Disk 
Assembly (HDA) for storing and retrieving data. Node processor P and data buffers B1, B2, 1 and B3 are con- 
10 nected to interface l/F 1 and thereby to network bus C n by a node bus identified as BUS 1. A second bus, iden- 
tified as BUS 2, provides connection between node processor P and data buffers B1, B2, 1 and B3 and interface 
l/F 2, which thereby provides connection to network bus R^,. Read/write buffer B3 also provides the node con- 
nection to storage element D. Nodes (0, 0) through (m, n) are similarly constructed. 

Node processor P, in a conventional sense, controls the network protocols, buffer management, error re- 
ts covery and storage media control such as head positioning, data encoding/decoding and defect handling. A 
typical example of the network node could be a Small Computer System Interface (SCSI) disk drive. 

In operation, array storage requests are received from one or more host processors and directed to des- 
ignated nodes within network 14 for execution. An exemplary array operation could be for H 0 to issue a RAID 
level 5 write operation. It will be appreciated that in a RAID level 5 configuration, data and parity information 
20 are distributed over a plurality of disks. The command is formed in a packetized mode for serial connections, 
or in handshake mode for parallel connections, and issued to appropriate nodes over bus Ro. Ho could issue 
a write to any desired node (0,0) to (0,n) residing on bus Ro The node that receives the command will be referred 
to in the discussion which follows as the primary node. Remaining network nodes will be referred to as sec- 
ondary nodes. The command contains information about secondary nodes that will be involved in subsequent 
25 read/write operations emanating from the primary node. These operations are necessary to complete the RAID 
level 5 write command. The primary node upon receiving a command takes responsibility for the operation if 
no error conditions occur. The primary node will report status conditions to the appropriate host processors 
for irregular conditions . 

The data storage system described above permits the distribution of the compute power necessary to exe- 
30 cute the array algorithms and functions to the nodes of a generalized network. The network can consist of 
intelligent disk drives such that the array algorithms and most common functions are executed at the array 
nodes. 

The host system is relieved of many of the array storage operations. Additionally, several array requests 
may be executed concurrently, each request being processed by a different primary node. The system thereby 
35 realizes increased performance beyond the capabilities of a storage system employing a single hardware con- 
troller. 

The two main attributes of the described system are: 

1. Increase in performance because each node contains sufficient processor power to relieve either the 
Host processor or the H/W array processor; and 
40 2. Relieves the bandwidth bottleneck of the I/O connection since multiple I/O paths can be used to connect 
the array nodes. 

The invention, therefore is very adaptable to various network architectures and provides improvements 
in network storage performance. This is due to the compute power which is available independent of host sys- 
tem application load. The invention is also intended to improve the incremental capacity and the reliability of 
45 computer networks. 

It is important to note that network 14 can be a generalized switching arrangement that would provide a 
multitude of paths to the individual storage devices coupled to the network. 

Listed below are examples to show the execution of the exemplary operations by the storage system ac- 
cording to the present invention. 



Operation Number 


Host 


Primary Node 


Secondary Node 


Operation 


1 


HO 


(0,1) 


(1.1) 


Write 


2 


H1 


(1.0) 




Read 


3 


H2 


(2,2) 


(1.2) 


Write 
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Operation 1 : 

H 0 issues a RAID level 5 write to node (0,1). Ho passes commands and data to node (0, 1) processor P 
and buffer B1, respectively, over network bus Ro and node bus BUS 1. Node (0, 1) processor P decodes the 

s command and determines a read-modify-write cycle is necessary involving secondary node (1,1). Node (0, 
1) processor P issues a read command with node (0, 1) identified as the source to node (1,1). The command 
is communicated to node (1,1) via bus Cv 

Simultaneously processor P in node (0,1) issues a read to HDA device D in node (0,1) to read old data 
from HDA device D into Buffer I. 

10 Node processor P in node (1,1) receives the read command via bus C 1t interface block l/F 1, and node 

bus BUS 1. Node (1,1) processor P decodes the received read command and retrieves read data from HDA 
device D into buffer I. Node (0,1) and (1,1) complete their respective reads asynchronously. When the reads 
are complete, node (0,1) contains new data in buffer B1 and old data in buffer I. Node (1,1) contains old parity 
in its buffer I. Node (1,1) informs node (0,1) that old parity data is in buffer. Node (0,1) reads old parity data 

15 over column bus into node (0, 1 ) buffer B2. Node (0, 1 ) now has new data, old data and old parity in its buffer. 
To complete the RAID 5 write operation, node processor (0,1) orders an exclusrve-OR of the data stored 
within buffers B1, B2 and I to generate the new parity data. The new parity is placed in buffer I and readied 
for transmission to node (1,1) for parity update. Simultaneously, node (0,1) writes the new data from buffer 
B1 to buffer B3 for writing to storage device D. Node (0,1) issues a normal write command of new parity from 

20 buffer I. 

Node (1,1) informs node (0,1) that parity write is complete and, in turn, when node (0,1) completes write 
of new data, informs host processor H 0 that the RAID level 5 write is complete.. 

Operation 2: 

25 

Host processor ^issues a normal read to node (1 ,0) over row bus R v Upon completion of the read, node 
(1,0) reports over bus R, to processor H, that the operation has completed. 

Operation 3: 

30 

Operation 3 occurs identical to operation 1 except command and data is passed over row bus R 2 and col- 
umn bus C2 and report operation complete messages provided to host H 2 over bus R 2 . 
Operations 1,2 and 3 may be performed concurrently. 

As shown by the examples described above, the architecture enables multiple concurrent operations that 
35 distributes the RAID algorithms over the array of nodes. The nodes act as peers and operate in a dynamic 
client/server mode. This invention facilitates expansion of nodes in both row and column directions. Such ex- 
pansion permits improvement in performance and capacity without impacting the host processor performance. 

The node operation is generalized and could be implemented so that each node can manage as a primary 
or secondary mode and communicate over a multiplicity of channels. 
40 it can thus be seen that there has been provided by the present invention a data storage system which 

provides increased performance beyond the capabilities of a host system managed storage system or a stor- 
age system employing a single hardware controller. The system described above permits the execution of mul- 
tiple storage operations concurrently, each operation being coordinated by a different node within the storage 
network. 

45 This architecture is scalable by design and may be expanded by the addition of nodes in both the row and 
column direction. In addition, the architecture is not limited to use with magnetic disk drive devices. It can be 
used to provide RAID technology on sequential access devices (e.g. QIC tapes, DAT tapes, etc.) as well as 
other direct access devices (e.g., optical disks and media changers) and robotic media changer storage de- 
vices. The system can be connected to a single host processor or may be interconnected with several host 

50 processors within a multiple processor computer system. 



Claims 

55 1. A data storage system including a network of nodes, interconnected to communicate with each other via 
a plurality of busses (Ro-RmiCo-Cn). characterized in that each one of said nodes includes: a data storage 
device (D) connected to receive data from and provide data to at least one of said plurality of busses (Ro- 
R m ;C 0 -Cn); and a node processor (P) connected to said at least one of said busses (Ro-Rn,; Co-CJ for con- 
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trolling the storage and retrieval of data at said data storage device (D), said node processor (P) being 
capable of controlling the storage and retrieval of data at data storage devices (D) associated with addi- 
tional nodes of said plurality of nodes through communications between said node processor (P) and said 
additional nodes via said plurality of busses (Ro-R^Co-Cn). 

5 

2. A data storage system according to claim 1 , characterized in that said plurality of busses includes a plur- 
ality of first busses (Ro-Rm) adapted to transmit data from and to a corresponding plurality of host system 
processors (H 0 -H J; and a plurality of second busses (C 0 -Cn), each one of said plurality of second busses 
(Co-Cn) intersecting with each one of said plurality of first busses (Ro-Rm); said nodes being located at 

10 the intersections of said first busses (Ro-Rm) with said second busses (C 0 -C n ). 

3. A data storage system according to claim 2 characterized in that within each one of said nodes: said data 
storage device (D) is connected to receive data from and provide data to the first and the second busses 
(e.g. Rn,, associated with said one of said nodes; and in that said node processor (P) is connected to 
said first and second busses (R^Cr) associated with said one of said nodes for controlling the storage 
and retrieval of data at said data storage device (D). 

4. A data storage system according to claim 3, characterized in that each one of said nodes further includes 
a first buffer (B1) connected between said first and second busses (Rni,C n ) associated with said one of 
said nodes whereby data transmission between said first and second busses (RnpCJ associated with said 
one of said nodes are conducted through said first buffer (B1). 

5. A data storage system according to claim 4, characterized in that each one of said nodes further includes: 
a second buffer (B3) connected between said first and second busses (R m ,C n ) associated with said one 
of said nodes, second said buffer (B3) being connected to said storage device (D), whereby data trans- 
mission between said data storage device (D) and said first and second busses (Rm.Cn) associated with 
said one of said nodes is effected through said second buffer (B3). 

6. A data storage system according to claim 5, characterized in that said data storage device (D) includes 
a magnetic disk drive. 

30 

7. A method for transferring data between a host processor (H m ) and a matrix of data storage nodes, each 
node including a data storage device (D) and control logic (P) for coordinating data storage operations 
for a plurality of data storage nodes, characterized by the step of selecting any one of said data storage 
nodes to control the transfer of data between said host processor (H 0 -H m ) and a first subset of said matrix 

35 of data storage nodes. 

8. A method according to claim 7, characterised by the step of selecting a second one of said data storage 
nodes to control the transfer of data between said host processor (Hm) and a second subset of said matrix 
of data storage nodes, whereby the transfer of data between said host processor (H m ) and said first subset, 

40 and the transfer between said host processor (H m ) and said second subset, may be performed concur- 

rently. 
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